Sequence variants affecting the genome-wide rate of germline microsatellite mutations

Microsatellites are polymorphic tracts of short tandem repeats with one to six base-pair (bp) motifs and are some of the most polymorphic variants in the genome. Using 6084 Icelandic parent-offspring trios we estimate 63.7 (95% CI: 61.9–65.4) microsatellite de novo mutations (mDNMs) per offspring per generation, excluding one bp repeats motifs (homopolymers) the estimate is 48.2 mDNMs (95% CI: 46.7–49.6). Paternal mDNMs occur at longer repeats than maternal ones, which are in turn larger with a mean size of 3.4 bp vs 3.1 bp for paternal ones. mDNMs increase by 0.97 (95% CI: 0.90–1.04) and 0.31 (95% CI: 0.25–0.37) per year of father’s and mother’s age at conception, respectively. Here, we find two independent coding variants that associate with the number of mDNMs transmitted to offspring; The minor allele of a missense variant (allele frequency (AF) = 1.9%) in MSH2, a mismatch repair gene, increases transmitted mDNMs from both parents (effect: 13.1 paternal and 7.8 maternal mDNMs). A synonymous variant (AF = 20.3%) in NEIL2, a DNA damage repair gene, increases paternally transmitted mDNMs (effect: 4.4 mDNMs). Thus, the microsatellite mutation rate in humans is in part under genetic control.

The fraction of G/C bases in an STR's motif (motif GC content) is negatively correlated with polymorphism rate (Supplementary Table 5) while negative correlation with expected heterozygosity is only observed for motif lengths above two bp (Supplementary Table 4).
Homopolymers from the C motif class have higher expected heterozygosity than A class homopolymers (Supplementary Table 4) but account for only 0.8% of all homopolymers (Supplementary Table 24). CpG microsatellites (CG motif class) also have on average higher expected heterozygosity than the other dinucleotide motif classes but account for only 0.4% of all dinucleotide repeats (Supplementary Table 24, Supplementary Table 4). Thus, although C class homopolymers and CG class dinucleotide microsatellites have higher expected heterozygosity values than other classes with equal motif lengths, their overall effect on microsatellite diversity is small since they are so rare.
Enrichment of A motif class homopolymers in the human genome is thought to be a result of the microsatellite-like structure often found at 3' ends of reverse transcribed RNA sequences, i.e. poly-A tails 2,3 and the high expected heterozygosity rate at CG class microsatellites is consistent with CpG sites acting as mutational hot spots 4 . Negative correlation of GC content to polymorphism rates and expected heterozygosity can likely be explained by the three hydrogen bonds between paired G/C bp, compared to two between A/T bp, making the slippage-causing disassociation from the template strand during replication less likely.
We define repeat purity as the ratio between the number of times the STR's repeat motif is observed in its RRT and the maximum number of repeat motifs if the sequence contained no interruptions (where the highest possible repeat purity is 1). Stratified on RRT length, all correlations between repeat purity and both expected heterozygosity and polymorphism rate were positive (Supplementary Table 2

Supplementary note 2: False positive rate estimation
To estimate the false positive rate of our mDNM detection we used three methods; PacBio CCS sequence data, available for four of our trios, mDNM sharing between nine monozygotic twin pairs and haplotype sharing across three-generation families We used haplotype resolved assemblies of the PacBio data and while we were unable to verify homopolymer mDNMs as the PacBio sequencing error rate was too high 10 we were able to verify the existence of 27 mDNMs with motif length >1. Out of these, 26 were true positives and one was a false positive at a dinucleotide repeat, giving an expected false positive rate of 3.7% for motif lengths greater than 1. (Supplementary Table 7).
For mDNMs observed in offspring with a monozygotic twin also present in our set, we checked whether the mDNM genotypes were concordant between both twins. We compared mDNM calls where the genotype was present in the monozygotic twin of the offspring. We treated genotype calls of the other twin present if the genotype quality was higher than or equal to 30, which is half of the value we require for trio mDNM detection. Out of the 230 comparable MZtwin mDNMs, 217 were found in both twins and 13 were discordant (Supplementary Table 8) which gives a false positive rate estimate of 5.6%. We note that this is likely to be an overestimate, as some of the differences between the twin pairs could be due to result of post zygotic mutations, representing true differences between twins 11 .
Using haplotype sharing across 540 three-generation families (795 trios), we counted how many times an mDNM was transmitted from an offspring to its child and estimated the transmission rate. The expected value of the transmission rate is 0.50 and deviations from it quantify false positive mDNM detection rates. For example, if the observed mutations were somatic and thus false positive as mDNMs, we would not observe transmission from the offspring to its child. We observe a transmission rate of 0.49 (N = 11,228, 95% CI: 0.48-0.50) which gives an estimated false positive rate of 2%, although transmission rates vary between motif lengths and thus the error rate estimates as well (Supplementary Table 9). Notably, the transmission rate for homopolymers is only 0.4 while the other motif lengths have transmission rates much closer or equal to 0.5.

Supplementary note 3: mDNM rate estimate comparison
Our mDNM rate estimate is nominally lower than a previous estimate 12 of 5.6 · 10 -5 and lower than the two estimates of 10.0 · 10 -4 and 2.7 · 10 -4 for tetra-and dinucleotide repeats, respectively 13 . This apparent discrepancy could be a result of a more conservative filtering in our study, a younger set of parents, a generally healthier cohort and a different range of motif lengths considered. However, the most likely reason for the apparent discrepancy is the sample size difference between the studies. Our set contains 53,026 individuals while the set analyzed by Mitra et al. 12 contained 6,548 individuals. Thus, our minimum detection frequency is 1/(2 ·53,026) = 9.0 · 10 -6 compared to the minimum detection frequency 12 of 1/(2 · 6,548) = 7.6 · 10 -5 enforced by the smaller sample size. We recomputed our mutation rate estimate conditioning on microsatellite frequency (Supplementary Table 25) and confirmed that at a detection frequency cutoff of 7.6 · 10 -5 our estimate becomes 5.6 · 10 -5 and matches the one presented by Mitra et al. 12 . Similarly, at a minimum frequency of 10% our estimate is comparable to the one from Sun et al. 13 Based on this we conclude that a mDNM rate estimate depends on the size of the sample set studied.

Supplementary note 4: mDNM rate comparison between motif equivalence classes
The mDNM rate is higher for C class homopolymers than for A class ones (Mann-Whitney U test P < 1 · 10 -230 , Fig. 2), but C homopolymers are much rarer and represent only 0.8% of all homopolymers in our set (Supplementary Table 24). The AC motif class has the highest mDNM rate of the dinucleotide microsatellites (Supplementary Table 26, Fig. 2). However, the average RRT length of the AC motif class is longest among dinucleotide classes. Including RRT length as a covariate the CG motif class has a higher mDNM rate than all other dinucleotide classes (Supplementary Table 27), in line with the fact that CpG sites have been shown to act as mutational hot spots 4 .
The AAT motif class had a higher mDNM rate than eight of the other nine trinucleotide repeat motif classes. Only the rarest class (ACG) did not show a significantly different mDNM rate (Supplementary Table 28). The AAT motif class accounts for 38.4% of all trinucleotide microsatellites and 79.0% of trinucleotide mDNMs and has an mDNM rate 1.5 times higher than the second highest class (Supplementary Table 28). A higher mDNM rate for AAT class motifs has been previously reported for other organisms 14,15 but not, to our knowledge for, humans.
An in vitro study of how repeat motifs affect the frequency of polymerase slippage during replication reported that motifs less likely to stall replication were more likely to mutate during replication. Of the dinucleotide repeat classes, microsatellites from the AC class had the lowest replication stall affinity 16 . The higher mDNM rate of microsatellites in the AAT motif class could be a result of its low replication stall affinity 16 . The two hydrogen bonds between A/T base-pairs compared to the three between G/C base-pairs also makes A/T pairs more likely to disassociate from each other, enabling the formation of secondary structures and possible mDNMs. Finally, repeats with a high A/T-content also have a sequence composition similar to elements involved in DNA unwinding at replication origins 17 during mitosis. These repeats could therefore function as aberrant replication origins and cause a higher mDNM rate during replication in S phase 17 .

Supplementary note 5: mDNMs in functionally annotated and early replicating regions
Previous studies have reported increased efficiency of mismatch repair (MMR) in earlyreplicating regions of the human genome 18 . Our results are in line with this since we see 1.28 (95% CI: 1.25-1.31) fold depletion of mDNMs in early replicating regions of the genome 19 .
Exonic mDNMs are rarer than their intergenic and intronic counterparts. In 2,568,858 transmissions of microsatellites intersecting exons by one or more bp, we observed 33 mDNMs.
We estimated the exonic mDNM rate as 1.3 · 10 -5 MMG, which is 3.9 (95% CI: 2.8-5.6) times lower than the genome-wide estimate. mDNMs are further 1.7 (95% CI: 1.3-2.1), 1.4 (95% CI: 1.3-1.5) and 4.2 (95% CI: 1.2-34.4)-fold depleted in 5'UTR and 3'UTR and splice regions, respectively. The 33 exonic mDNMs occurred at 21 unique microsatellites, of which 19 had motif lengths that were multiples of three and since amino acids are coded with three bp codons, mutations at microsatellites with multiple of three motif lengths are unlikely to cause a frameshift but rather an in-frame alteration of a gene. Sixteen of the exonic mDNMs were trinucleotide microsatellites, three were hexanucleotide microsatellites and the remaining two were homopolymers.
Tri-and hexanucleotide repeats were enriched in coding exons (chi squared test P < 1 · 10 -320 ) compared with the rest of genome. Microsatellites with motif lengths that are multiples of three accounted for 93.3% of exon intersecting microsatellites and 70.9% of all exon intersecting non-polymorphic STRs (Supplementary Table 29). In contrast, 12.3% of all microsatellites had motifs that are multiples of three (Supplementary Table 29) and 44.5% of non-polymorphic STRs.
The average purity of microsatellites was 0.94, while among microsatellites in exons the purity was notably lower (0.87, Mann-Whitney P = 1 · 10 -153 ). Purity was positively correlated with the mDNM rate, so decreased purity in exons may decrease occurrences of possibly pathogenic mDNMs. This indicates that there is a possible positive selection for point mutations that reduce the purity of exonic microsatellites or a possible negative selection for point mutations that increase in their purity. The purity difference is largest for trinucleotide repeats, the most common motif length for exon intersecting microsatellites (Supplementary Table 30). Nonpolymorphic coding STRs do not have decreased purity compared to their intergenic counterparts, so point mutations are less likely to be the mechanism preventing mutations at these exonic STRs (Supplementary Table 30).

Supplementary note 6: Replication of mutation rate trends
As the RRT length increased from ten to 100 bp the mDNM rate also increased for all motif lengths (Fig. 2), consistent with findings from previous studies [20][21][22][23] and, intuitively, the opportunities for errors during replication increase with a microsatellite's RRT length.
Microsatellites with longer RRTs have been shown to be more likely to contract and shorter ones to expand [23][24][25][26][27][28][29] . We replicate this. For all RRT length thresholds, a higher fraction of mDNMs had a gain of repeat motifs below the threshold than above it, i.e., the microsatellite group with shorter overall RRTs were more likely to expand than the group with longer RRTs (Supplementary Table 31).
The mDNM rate for homopolymers was positively correlated with the motif G/C content, while di-, tri-, tetra-and pentanucleotide repeats had a negative correlation with the mDNM rate and for hexanucleotide repeats we lacked power to detect a correlation with the mDNM rate (  Table 32). mDNM rates varied by repeat motif class, AAT motif mDNMs were by far the most common ones among trinucleotides (Supplementary Table 28). The rare C homopolymers mutated more frequently than the A homopolymers and AC motif microsatellites had the highest mutation rate of the dinucleotide microsatellites before correcting for RRT (Supplementary Table 26) length and CG motif microsatellites after correction (Fig. 2

Supplementary note 7: Motif length enrichment excluding homopolymers
A higher fraction of maternal mDNMs occur at tri-, penta-and hexanucleotide microsatellites while dinucleotide microsatellites represent a larger fraction of paternal mDNMs (Supplementary Table 14, Fig. 3).
The average number of bp involved without homopolymers is larger in maternal mDNMs than in paternal mDNMs (3.9 vs 3.4 bp, Mann-Whitney U test P = 2.6 · 10 -23 ). Stratifying on motif length reveals that maternal mDNMs affect more bp on average at di-and tetranucleotide microsatellites ( Table 2).
Considering mDNMs with motif lengths above one, the number of repeats in the reference is higher for paternal mDNMs (16.8 vs 15.7 repeats, Mann-Whitney U test P = 2.8 · 10 -51 ).

Supplementary note 8: Motif length fraction change with age including homopolymers
Tetranucleotide mDNMs increase their fraction with paternal age (Linear regression P = 1.5 · 10 -6 ) and the fraction of di-and hexanucleotide mDNMs increases with maternal age.
For both maternal and paternal mDNMs the fraction of mDNMs at homopolymers decreases with age.

Supplementary note 9: Mismatch repair efficiency effect of rs4987188
Studies of G317D, the yeast homolog of rs4987188, conclude that it does not affect MSH2 expression levels, rather that protein products of the mutated allele have a decreased mismatch repair efficiency relative to the wild-type allele and need to be expressed at higher levels to be equivalent to it 30,31 . The first experiment was a direct comparison of mismatch repair rates for the wild-type allele and G317D, resulting in a significant 1.7 MMR defect. The second experiment compared how well G317D could complement a msh2Δ-null mutant when expressed from the native MSH2 promoter and when expressed at higher levels from a GAL10 promoter. When expressed from the GAL10 promoter, the yeast G317D allele partially complemented the msh2Δ-null mutant, in turn no complementation was observed when it was expressed from the native MSH2 promoter. Combined, these results suggest that G317D affects MMR efficiency and that in vivo, the function of this variant could change with levels of expression. To emphasize how this supports our results, we generated a boxplot of the RNAexpression of MSH2 for each rs4987188 genotype, (0/0),(0/1) and (1/1) and show that, in our data, they are in fact not different from each other (Supplementary Figure 8). We thus conclude that the MMR efficiency in carriers is decreased relative to non-carriers, and therefore their mutational load should be increased.

Supplementary Figures
Supplementary Figure 1 | Expected heterozygosity vs number of repeats in UKB. Average expected heterozygosity in UKB as a function of repeat number stratified on motif length with error bars representing 95% confidence intervals. The drop in all motif lengths is most likely due our inability to reliably detect long alleles from short reads, causing underestimation of expected heterozygosity values at microsatellite with long reference alleles (n1bp=753,664; n2bp=330,797; n3bp=203,673; n4bp=442,153; n5bp=322,598; n6bp=338,919). Supplementary